Efficient Community Detection in Large 
Networks using Content and Links 



Yiye Ruan, David Fuhry, Srinivasan Parthasarathy 
Department of Computer Science and Engineering 
The Ohio State University 
{ruan,fuhry,srini}@cse.ohio-state.edu 

December 4, 2012 



Abstract 

In this paper we discuss a very simple approachi of combining content and 
link information in graph structures for the purpose of community discovery, a 
fundamental task in network analysis. Our approach hinges on the basic intuition 
that many networks contain noise in the link structure and that content information 
can help strengthen the community signal. This enables ones to eliminate the 
impact of noise (false positives and false negatives), which is particularly prevalent 
in online social networks and Web-scale information networks. 

Specifically we introduce a measure of signal strength between two nodes in 
the network by fusing their link strength with content similarity. Link strength 
is estimated based on whether the link is likely (with high probability) to reside 
within a community. Content similarity is estimated through cosine similarity or 
Jaccard coefficient. We discuss a simple mechanism for fusing content and link 
similarity. We then present a biased edge sampling procedure which retains edges 
that are locally relevant for each graph node. The resulting backbone graph can 
be clustered using standard community discovery algorithms such as Metis and 
Markov clustering. 

Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia 
and CiteSeer) with varying sizes and characteristics, we demonstrate the effective- 
ness and efficiency of our methods over state-of-the-art learning and mining ap- 
proaches several of which also attempt to combine link and content analysis for 
the purposes of community discovery. Specifically we always find a qualitative 
benefit when combining content with link analysis. Additionally our biased graph 
sampling approach realizes a quantitative benefit in that it is typically several or- 
ders of magnitude faster than competing approaches. 



1 Introduction 

An increasing number of applications on the World Wide Web rely on combining link 
and content analysis (in different ways) for subsequent analysis and inference. For 
example, search engines, like Google, Bing and Yahoo! typically use content and link 
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information to index, retrieve and rank web pages. Social networking sites like Twitter, 
Flickr and Facebook, as well as the aforementioned search engines, are increasingly 
relying on fusing content (pictures, tags, text) and link information (friends, followers, 
and users) for deriving actionable knowledge (e.g. marketing and advertising). 

In this article we limit our discussion to a fundamental inference problem — that 
of combining link and content information for the purposes of inferring clusters or 
communities of interest. The challenges are manifold. The topological characteristics 
of such problems (graphs induced from the natural link structure) makes identifying 
community structure difficult. Further complicating the issue is the presence of noise 
(incorrect links (false positives) and missing links (false negatives). Determining how 
to fuse this link structure with content information efficiently and effectively is unclear. 
Finally, underpinning these challenges, is the issue of scalability as many of these 
graphs are extremely large running into millions of nodes and billions of edges, if not 
larger. 

Given the fundamental nature of this problem, a number of solutions have emerged 
in the literature. Broadly these can be classified as: i) those that ignore content infor- 
mation (a large majority) and focus on addressing the topological and scalability chal- 
lenges, and ii) those that account for both content and topological information. From 
a qualitative standpoint the latter presumes to improve on the former (since the null 
hypothesis is that content should help improve the quality of the inferred communities) 
but often at a prohibitive cost to scalability. 

In this article we present CODICIlfl a family of highly efficient graph simplifica- 
tion algorithms leveraging both content and graph topology to identify and retain im- 
portant edges in a network. Our approach relies on fusing content and topological (link) 
information in a natural manner The output of CODICIL is a transformed variant of 
the original graph (with content information), which can then be clustered by any fast 
content-insensitive graph clustering algorithm such as METIS or Markov clustering. 
Through extensive experiments on real-world datasets drawn from Flickr, Wikipedia, 
and CiteSeer, and across several graph clustering algorithms, we demonstrate the ef- 
fectiveness and efficiency of our methods. We find that CODICIL runs several orders 
of magnitude faster than those state-of-the-art approaches and often identifies commu- 
nities of comparable or superior quality on these datasets. 

This paper is arranged as follows. In Section|2]we discuss existent research efforts 
pertaining to our work. The algorithm of CODICIL, along with implementation details, 
is presented in Section [3] We report quantitative experiment results in Section H] and 
demonstrate the qualitative benefits brought by CODICIL via case studies in Section|5] 
We finally conclude the paper in Section |6] 

2 Related Work 

Community Discovery using Topology (and Content): Graph clustering/partitioning 
for community discovery has been studied for more than five decades, and a vast 
number of algorithms (exemplars include Metis [15], Graclus |6| and Markov clus- 
tering 1271 ) have been proposed and widely used in fields including social network 

' community Discovery Inferred from Content Information and Link-structure 
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analytics, document clustering, bioinformatics and others. Most of those methods, 
however, discard content information associated with graph elements. Due to space 
limitations, we suppress detailed discussions and refer interested readers to recent sur- 
veys (e.g. [9 1) for a more comprehensive picture. Leskovec et al. compared a multitude 
of community discovery algorithms based on conductance score, and discovered the 
trade-off between clustering objective and community compactness 1 16|. 

Various approaches have been taken to utilize content information for community 
discovery. One of them is generative probabilistic modeling which considers both 
contents and links as being dependent on one or more latent variables, and then esti- 
mates the conditional distributions to find community assignments. PLSA-PHITS Q, 
Community-User- Topic model |29| and Link-PLSA-LDA |20| are three representa- 
tives in this category. They mainly focus on studies of citation and email communica- 
tion networks. Link-PLSA-LDA, for instance, was motivated for finding latent topics 
in text and citations and assumes different generative processes on citing documents, 
cited documents as well as citations themselves. Text generation is following the LDA 
approach, and link creation from a citing document to a cited document is controlled 
by another topic-specific multinomial distribution. 

Yang et al. |(28 | introduced an alternative discriminative probabilistic model, PCL- 
DC, to incorporate content information in the conditional link model and estimate the 
community membership directly. In this model, link probability between two nodes 
is decided by nodes' popularity as well as community membership, which is in turn 
decided by content terms. A two-stage EM algorithm is proposed to optimize com- 
munity membership probabilities and content weights alternately. Upon convergence, 
each graph node is assigned to the community with maximum membership probability. 

Researchers have also explored ways to augment the underlying network to take 
into account the content information. The S A-Cluster-Inc algorithm proposed by Zhou 
et al. Il30l . for example, inserts virtual attribute nodes and attribute edges into the graph 
and computes all-pair random walk distances on the new attribute-augmented graph. 
K-means clustering is then used on original graph nodes to assign them to different 
groups. Weights associated with attributes are updated after each k-means iteration 
according to their clustering tendencies. The algorithm iterates until convergence. 

Ester et al. [SJ proposed an heuristic algorithm to solve the Connected k-Center 
problem where both connectedness and radius constraints need to be satisfied. The 
complexity of this method is dependent on the longest distance between any pair of 
nodes in the feature space, making it susceptible to outliers. Biologists have studied 
methods II 1311261 to find functional modules using network topology and gene expres- 
sion data. Those methods, however, bear domain-specific assumptions on data and are 
therefore not directly applicable in general. 

Recently Giinnemann et al. lfT2l introduced a subspace clustering algorithm on 
graphs with feature vectors, which shares some similarity with our topic. Although 
their method could run on the full feature space, the search space of their algorithm 
is confined by the intersection, instead of union, of the epsilon-neighborhood and the 
density -based combined cluster Furthermore, the construction of both neighborhoods 
are sensitive to their multiple parameters. 

While decent performance can be achieved on small and medium graphs using 
those methods, it often comes at the cost of model complexity and lack of scalability. 
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Some of them take time proportional to the number of values in each attribute. Oth- 
ers take time and space proportional to the number of clusters to find, which is often 
unacceptable. Our method, in contrast, is more lightweight and scalable. 
Clustering/Learning Multiple Graphs: Content-aware clustering is also related to 
multiple-view clustering, as content information and link structure can be treated as 
two views of the data. Strehl and Ghose |23| discussed three consensus functions 
(cluster-wise similarity partitioning, hyper-graph partitioning and meta-clustering) to 
implement cluster ensembles, in which the availability of each individual view's clus- 
tering is assumed. Tang et al. Il24l proposed a linked matrix factorization method, 
where each graph's adjacency matrix is decomposed into a "characteristic" matrix and 
a common factor matrix shared among all graphs. The purpose of factorization is to 
represent each vertex by a lower-dimensional vector and then cluster the vertices using 
corresponding feature vectors. Their method, while applicable to small-scale problems, 
is not designed for web-scale networks. 

Graph Sampling for Fast Clustering: Graph sampling (also known as "sparsifica- 
tion" or "filtering") has attracted more and more focus in recent years due to the ex- 
plosive growth of network data. If a graph's structure can be preserved using fewer 
nodes and/or edges, community discovery algorithms can obtain similar results using 
less time and memory storage. Maiya and Berger-Wolf |17| introduced an algorithm 
which greedily identifies the node that leads to the greatest expansion in each iteration 
until the user-specified node count is reached. By doing so, an expander-like node- 
induced subgraph is constructed. After clustering the subgraph, the unsampled nodes 
can be labeled by using collective inference or other transductive learning methods. 
This extra post-processing step, however, operates on the original graph as a whole and 
easily becomes the scalability bottleneck on larger networks. 

Satuluri et al. [22] proposed an edge sampling method to preferentially retain edges 
that connect two similar nodes. The localized strategy ensures that edges in the rela- 
tively sparse areas will not be over-pruned. Their method, however, does not consider 
content information either. 

Edge sampling has also been applied to other graph tasks. Karger lfT4l studied the 
impact of random edge sampling on original graph's cuts, and proposed randomized 
algorithms to find graph's minimum cut and maximum flow. Aggarwal et al. 1 1] pro- 
posed using edging sampling to maintain structural properties and detect outliers in 
graph streams. The goals of those work are not to preserve community structure in 
graphs, though. 

3 Methodology 

We begin by defining the notations used in the rest of our paper Let Qt — (V, £t, T) 
be an undirected graph with n vertices V = ui , . . . , v„ , edges £f , and a collection of n 
corresponding term vectors T = ti, . . . , tn- We use the terms "graph" and "network" 
interchangeably as well as the terms "vertex" and "node". Elements in each term vector 
ti are basic content units which can be single words, tags or n-grams, etc., depending 
on the context of underlying network. For each graph node Vi G V, let its term vector 
be ti. 
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Our goal is to generate a simplified, edge-sampled graph Gsampie = (V, Esampie) 
and then use Qsampie to find communities with coherent content and link structure. 
Gsampie should posscss the following properties: 

• Gsampie has the same vertex set as Gt- That is, no node in the network is added 
or removed during the simplification process. 

• \£sampie\ ^ |^t|' this enables both better runtime performance and lower 
memory usage in the subsequent clustering stage. 

• Informally put, the resultant edge set Esampie would connect node pairs which 
are both structure-wise and content-wise similar As a result, it is possible for 
our method to add edges which were absent from £t since the content similarity 
was overlooked. 



3.1 Key Intuitions 

The main steps of the CODICIL algorithm are: 

1. Create content edges. 

2. Sample the union of content edges and topological edges with bias, retaining 
only edges that are relevant in local neighborhoods. 

3. Partition the simplified graph into clusters. 

The constructed content graph and simplified graph have the same vertices as the 
input graph (vertices are never added or removed), so the essential operations of the 
algorithm are constructing, combining edges and then sampling with bias. Figure [T] 
illustrates the work flow of CODICIL. 



1 . Create content edges 



Term vectors T 



Content edges £c 



3. Sample edges with bias 



2. Combine edges 

^ 


Edge union £"„ 


Topological edges £t 





Edge subset Esampie 




4. Cluster 


Clustering C 



Vertices V 



Figure 1: Work flow of CODICIL 



From the term vectors T, content edges £c are constructed. Those content edges 
and the input topological edges St are combined as which is then sampled with bias 
to form a smaller edge set Esampie where the most relevant edges are preserved. The 
graph composed of these sampled edges is passed to the graph clustering algorithm 
which partitions the vertices into a given number of clusters. 



3.2 Basic Framework 

The pseudo-code of CODICIL is given in Algorithm[T] 
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Algorithm 1 CODICIL 



Input: Qt = (yj£t,T), k, normalize{-), a G [0,1], /, cluster algo{-,-), 

similarity {■, ■) 
Returns: C (a disjoint clustering of V) 
\\Create content edges £c 

for z = 1 to |V| do 
foreach vj e TopK{vi, k,T)Ao 

U {Vi,Vj) 

end for 
end for 

\\ Combine £t and £c. Retain edges with a bias towards locally relevant ones 

^sample ^ ^ 

for i = 1 to |V| do 

contains w/s neighbors in the edge union 

Ti ^ ngbr{vi,£u) 

for J = 1 to IFjl do sim^ij <— similarity {ngbr{vi,£t),ngbr{jj,£t)) 
simnorm* i normalize{sim* i) 
for j = 1 to iFjl do sim'^ij •(— similarity {ti,t^.) 
simnorm'^i normalize{sim'^i) 

for j = 1 to jFil do sirriij ^ a ■ simnorm^ij + (1 — a) • simnorm'^ij 
\\Sort similarity values in descending order. Store the corresponding node IDs 
in idxi 

[vali^idxi] <r- descsort{simi) 



for j = 1 to 



do 



£ sample ^ £ sample ' iyi-j'^idxij) 

end for 
end for 

Q sample ^ (V, £ sample^ 

C cluster algo{G sample, \\Partition into I clusters 
return C 



CODICIL takes as input I) Qt, the original graph consisting of vertices V, edges 
£t and term vectors T where ti is the content term vector for vertex Vi, 1 < i < 
\V\ = \T\, 2) k, the number of nearest content neighbors to find for each vertex, 3) 
normalize{x), a function that normalizes a vector x, 4) a, an optional parameter that 
specifies the weights of topology and content similarities, 5) I, the number of output 
clusters desired, 6) clusteralgo{Q ,1), an algorithm that partitions a graph Q into I 
clusters and 7) similarity {x,y) to compute similarity between x and y. Note that 
any content-insensitive graph clustering algorithm can be plugged in the CODICIL 
framework, providing great flexibiUty for applications. 
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3.2.1 Creating Content Edges 



Lines 2 through 7 detail how content edges are created. For each vertex w;, its k most 
content-similar neighbors are computeqj. For each of u^'s top-fc neighbors Vj, an edge 
{vi,Vj) is added to content edges £c- In our experiments we implemented the TopK 
sub-routine by calculating the cosine similarity of U's TF-IDF vector and each other 
term vector's TF-IDF vector. For a content unit c, its TF-IDF value in a term vector ti 
is computed as 

tf-tdf{c, u) ^ VWWti) ■ log 1 1 + ^1-^1 — -) . (1) 

The cosine similarity of two vectors x and y is 

cosme{x,y) ^ — ^-p. (2) 

The k vertices corresponding to the k highest TF-IDF vector cosine similarity val- 
ues with Vi are selected as the top-fc neighbors of Vi. 



3.2.2 Local Ranking of Edges and Grapli Simplification 

Line 9 takes the union of the newly-created content edge set £c and the original topo- 
logical edge set £t. In lines 10 through 24, a sampled edge set £ sample is constructed by 
retaining the most relevant edges from the edge union For each vertex Vi, the edges 
to retain are selected from its local neighborhood in (line 13). We compute the topo- 
logical similarity (line 14) between node Vi and its neighbor jj as the relative overlap 
of their respective topological neighbor sets, / = ngbr{vi, £t) and J = ngbr{'jj,£t), 
using similarity (either cosine similarity as in Equation |2] or Jaccard coefficient as 
defined below): 

jaccard{I, J) = ■ (3) 

After the computation of the topological similarity vector sim^i finishes, it is nor- 
malized by normalize (line 15). In our experiments we implemented normalize with 
either zero-one, which simply rescales the vector to [0, 1]: 

zero-one{x) = {xi — min{x)) / {max{x) — min{x)) (4) 

or z-nor 

nE, 

which centers and normalizes values to zero mean and unit variance: 

z-norm{x) ^ — - — ,a =— y{Xt - fi) . (5) 

rr TT T — \ '—^ 



^Besides top-fc criteria, we also investigated using all-pairs similarity above a given global threshold, but 
this tended to produce highly imbalanced degree distributions. 

'Montague and Aslam 1 19| pointed out that z-norm has the advantage of being both shift and scale 
invariant as well as outlier insensitive. They experimentally found it best among six simple combination 
schemes discussed in flOl . 
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Likewise, we compute w,;'s content similarity to its neighbor 7^ by applying similarity 
on term vectors ti and t-^^ and normalize those similarities (lines 16 and 17). The 
topological and content similarities of each edge are then aggregated with the weight 
specified by a (line 18). 

In lines 20 through 23, the edges with highest similarity values are retained. As 
stated in our desiderata, we want \£sampie\ ^ \£t \ and therefore need to retain fewer 
than \Ti \ edges. Inspired by ll22l . we choose to keep [\/|r7f] edges. This form has the 
following properties: 1) every vertex Vi will be incident to at least one edge, therefore 
the sparsification process does not generate new singleton, 2) concavity and monotonic- 
ity ensure that larger-degree vertices will retain no fewer edges than smaller-degree ver- 
tices, and 3) sublinearity ensures that smaller-degree vertices will have a larger fraction 
of their edges retained than larger-degree vertices. 

3.2.3 Partitioning the Sampled Graph 

Finally in lines 25 through 27 the sampled graph Q sample is formed with the retained 
edges, and the graph clustering algorithm clusteralgo partitions Gsampie into I clus- 
ters. 

3.2.4 Extension to Support Complex Graphs 

The proposed CODICIL framework can also be easily extended to support community 
detection from other types of graph. If an input graph has weighted edges, we can 
modify the formula in line 1 8 so that sirriij becomes the product of combined similarity 
and original edge weight. Support of attribute graph is also straightforward, as attribute 
assignment of a node can be represented by an indicator vector, which is in the same 
form of a text vector. 

3.3 Key Speedup Optimizations 
3.3.1 TopK Implementation 

When computing cosine similarities across term vectors ti, . . . , t|7-|, one can truncate 
the TF-IDF vectors by only keeping m elements with the highest TF-IDF values and 
set other elements to 0. When m is set to a small value, TF-IDF vectors are sparser and 
therefore the similarity calculation becomes more efficient with little loss in accuracy. 

We may also be interested in constraining content edges to be within a topological 
neighborhood of each node Vi, such that the search space of TopK algorithm can 
be greatly reduced. Two straightforward choices are 1) "1-hop" graph in which the 
content edges from Vi are restricted to be in m's direct topological neighborhood, and 
2) "2-hop" graph in which content edges can connect Vi and its neighbors' neighbors. 

Many contemporary text search systems make use of inverted indices to speed up 
the operation of finding the k term vectors (documents) with the largest values of Equa- 
tion |2] given a query vector ti . We used the implementation from Apache Lucene for 
the largest dataset. 
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3.3.2 Fast Jaccard Similarity Estimation 



To avoid expensive computation of the exact Jaccard similarity, we estimate it by using 
minwise hashing [3|. An unbiased estimator of sets A and B's Jaccard similarity can 
be obtained by 

1 

jaccardiA,B) = - ^ /(min(7r^(A)) = min(7r^(B))) , (6) 
1=1 

where tti , 7r2 , • ■ • , tt/j are /i permutations drawn randomly from a family of minwise 
independent permutations defined on the universe A and B belong to, and / is the 
identity function. After hashing each element once using each permutation, the cost 
for similarity estimation is only 0{h) where h is usually chosen to be less than \ A\ and 
\B\. 



3.3.3 Fast Cosine Similarity Estimation 

Similar to Jaccard coefficient, we can apply random projection method for fast estimate 
of cosine similarity |i4J. In this method, each hash signature for a rf-dimensional vector 
X is h{x) = sgn {x, r), where r E {0, 1}'' is drawn randomly. For two vectors x and 
y, the following holds: 

„ , , , , arccos (cosine(x,y)) 

Pr[h{x) = h{y)] = 1 ^ . (7) 



3.4 Performance Analysis 

Lines 3-7 of CODICIL are a preprocessing step which compute for each vertex its top- 
k most similar vertices. Results of this one-time computation can be reused for any 
k' < k. Its complexity depends on the implementation of the TopK operation. On our 
largest dataset Wikipedia this step completed within a few hours. 

We now consider the loop in lines 1 1-24 where CODICIL loops through each ver- 
tex. For lines 14 and 16 we use the Jaccard estimator from Section [3.3.2| for which runs 
in 0{h) with a constant number of hashes h. The normalizations in lines 15 and 17 
are 0(|ri|) and the inner loop in lines 21-23 is 0(y^]r7f). Sorting edges by weight in 
line 20 is 0(|ri I log |r,;|). The size of F^, the union of topology and content neighbors, 
is at most n but on average much smaller in real world graphs. Thus the loop in lines 
1 1-24 runs in 0(n^ log n). 

The overall runtime of CODICIL is the edge preprocessing time, plus 0{n^ log n) 
for the loop, plus the algorithm-dependent time taken by clusteralgo. 



4 Experiments 

We are interested in empirically answering the following questions: 

• Do the proposed content-aware clustering methods lead to better clustering 
than using graph topology only? 
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• How do our methods compare to existing content-aware clustering meth- 
ods? 

• How scalable are our methods when the data size grows? 
4.1 Datasets 

Three publicly-available datasets with varying scale and characteristic are used. Their 
domains cover document network as well as social network. Each dataset is described 
below, and Table[T]follows, listing basic statistics of them. 

4.1.1 CiteSeer 

A citation network of computer science publication^ each of which labeled as one of 
six sub-fields. In our graph, nodes stand for publications and undirected edges indi- 
cate citation relationships. The content information is stemmed words from research 
papers, represented as one binary vector for each document. Observe that the density 
of this network (average degree 2.74) is significantly lower than normally expected for 
a citation network. 

4.1.2 Wikipedia 

The static dump of English Wikipedia pages (October 201 1). Only regular pages be- 
longing to at least one category are included, each of which becomes one node. Page 
links are extracted. Cleaned bi-grams from title and text are used to represent each 
document's content. We use categories that a page belongs to as the page's class la- 
bels. Note that a page can be contained in more than one category, thus ground truth 
categories are overlapping. 

4.1.3 Flickr 

From a dataset of tagged photo^ we removed infrequent tags and users associated with 
only few tags. Each graph node stands for a user, and an edge exists if one user is in 
another's contact list. Tags that users added to uploaded photos are used as content 
information. Flickr user groups are collected as ground truth. Similar to Wikipedia 
categories, Flickr user groups are also overlapping. 





|V| 




#CC 


|CCmax| 


# Uniq. Content Unit 


Avg|t,| 


# Class 


Wikipedia 


3,580,013 


162,085,383 


10 


3,579,995 


1,459,335 


202 


595,355 


Flickr 


16,710 


716,063 


4 


16,704 


1,156 


44 


184,334 


CiteSeer 


3,312 


4,536 


438 


2,110 


3,703 


32 


6 



Table 1: Basic statistics of datasets. # CC: number of connected components. 
|CCmax|: size of the largest connected component. Avg \ti\: average number of non- 
zero elements in term vectors. # Class: number of (overlapping) ground truth classes. 



' http : / /www. cs .umd. edu/projects/ lings /pro jects/lbc/index .html| 

■ http ://staff.science. uva ■ nl/ -xirong/ index . php?n=DataSet ■ Flickr3m| 
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4.2 Baseline Methods 



In terms of strawman methods, we compare the CODICIL methods with three exist- 
ing content-aware graph clustering algorithms, SA-Cluster-Inc |30|, PCL-DC p28l and 
Link-PLSA-LDA (L-P-LDA) |20|. Their methodologies have been briefly introduced 
in Section |2] When applying SA-Cluster-Inc, we treat each term in 7" as a binary- 
valued attribute, i.e. for each graph node i every attribute value indicates whether the 
corresponding term is present in ti or not. For L-P-LDA, since it does not assume a 
distinct distribution over topics for each cited document individually, only citing doc- 
uments' topic distributions are estimated. As a result, there are 2313 citing documents 
in CiteSeer dataset and we report the F-score on those documents using their corre- 
sponding ground-truth assignments. 

Previously SA-Cluster-Inc has been shown to outperform k-SNAP f25] and PCL- 
DC to outperform methods including PLS A-PHITS , LDA-Link- Word 1 7 1 and Link- 
Content-Factorization ||3T1 . Therefore we do not compare with those algorithms. 

Two content-insensitive clustering algorithms are included in the experiments as 
well. The first method, "Original Topo", clusters the original network directly. The 
second method samples edges solely based on structural similarity and then clusters 
the sampled graph |22|, and we refer to it as "Sampled Topo" hereafter. 

Finally, we also adapt LDA and K-mean^ algorithm to cluster graph nodes us- 
ing content information only. When applying LDA, we treat each term vector ti as 
a document, and one product of LDA's estimation procedure is the distribution over 
latent topics, 0t . , for each ti (more details can be found at the original paper by Blei et 
al. lH). Therefore, we treat each latent topic as a cluster and assign each graph node to 
the cluster that corresponds to the topic of largest probability. We use GibbsLDA-nQ, 
a C++ implementation of LDA using Gibbs sampling ifTTIl which is faster than the 
variational method proposed originally. Results of this method are denoted as "LDA". 

4.3 Experiment Setup 
4.3.1 Parameter Selection 

There are several tunable parameters in the CODICIL framework, first of which is k, 
the number of content neighbors in the TopK sub-routine. We propose the following 
heuristic to decide a proper value for k : the value of k should let | f c | ~ \£t\- As a result, 
k is set to 50 for both Wikipedia (\£c\ = 150, 955, 014) and Flickr (\£c\ = 722, 928). 
For CiteSeer, we experiment with two relatively higher k values (50, \£c\ — 103, 080 
and 70, \£c\ = 143, 575) in order to compensate the extreme sparsity in the original 
network. Though simplistic, this heuristic leads to decent clustering quality, as shown 
in Section |431 and avoids extra effort for tuning. 

Another parameter of interest is a, which determines the weights for structural and 
content similarities. We set a to 0.5 unless otherwise specified, as in Section 1471 The 
number of hashes (h) used for minwise hashing (Jaccard coefficient) is 30, and 512 for 
random projection (cosine similarity). Experiments with both choices of similarity 

*We do not report ranning time of K-means as it is not implemented in C or C++. 
Thttp ; / / gibbs Ida . source forge .net/. 
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function are performed. As for to, the number of non-zero elements in term vectors, 
we let TO = 10 for Wikipedia and Flickr. This optional step is omitted for CiteSeer 
since the speedup is insignificant. 



4.3.2 Clustering Algorithm 

We combine the CODICIL framework with two different clustering algorithms, Meti^ f T5l 
and Multi-level Regularized Markov Clustering (MLR- MCLfl 1.21 J . Both clustering 
algorithms are also applied on strawman methods. 



4.4 Effect of Simplification on Graph Structure 

In this section we investigate the impact of topological simplification (or sampling) 
on the spectrum of the graph. For both CiteSeer and Flickr (results for Wikipedia are 
similar to that of Flickr) we compute the Laplacian of the graph and then examine the 
top part of its eigenspectrum (first 2000 eigenvectors). Specifically, in Figure |2] we 
order the eigenvectors from the smallest one to the largest one (on the X axis) and plot 
corresponding eigenvalues (on the Y axis). 
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1000 1200 1400 1600 1800 2000 



200 400 600 



1000 1200 1400 1600 



(a) Citeseer (b) Flickr 

Figure 2: Eigenvalues of graph Laplacian before and after simplification 

The multiplicity of as an eigenvalue in such a plot corresponds to the number 
of independent components within the graph 1 18|. For CiteSeer we see an increase in 
the number of components as a result of topological simplification whereas for Flickr 
(similarly for Wikipedia) the number of components is unchanged. Our hypothesis is 
that for datasets like CiteSeer this will have a negative impact on the quality of the 
resulting clustering. We further hypothesize that our content-based enhancements will 
help in overcoming this shortfall. 

Note that the sum of eigenvalues for the complete spectrum is proportional to the 
number of edges in the graph 1 18 1 so this explains why the plots for the original graphs 
are slightly above those for the simplified graph even though the overall trends (e.g. 
spectral gap, relative changes in eigenvalues), except for the number of components, 
are quite similar for both datasets. 



http : //glaros ■ dtc .umn . edu/gkhome/metis/metis/download| 



http : / /www. cse ■ Ohio- state . edu/-satuluri/ research . html] 
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4.5 Clustering Quality 



We are interested in comparison between the predicted clustering and the real com- 
munity structure since group/category information is available for all three datasets. 
Later in Section |5] we will evaluate CODICIL'S performance qualitatively. While it 
is tempting to use conductance or other cut-based objectives to evaluate the quality 
of clustering, they only value the structural cohesiveness but not the content cohesive- 
ness of resultant clustering, which is exactly the motivation of content-aware clustering 
algorithm. Instead, we use average F-score with regard to the ground truth as the clus- 
tering quality measure, as it takes content grouping into consideration and ensures a 
fair comparison among different clusterings. Given a predicted cluster p and with ref- 
erence to a ground truth cluster g (both in the form of node set), we define the precision 
rate as and the recall rate as ■'^j^- The F-score of p on g, denoted as F{p, g), is 
the harmonic mean of precision and recall rates. 

For a predicted cluster p, we compute its F-score on each g in the ground truth 
clustering G and define the maximal obtained as p's F-score on G. That is: 

F{p,G) ^maj, F{p,g) . (8) 

The final F-score of the predicted clustering P on the ground truth clustering G 
is then calculated as the weighted (by cluster size) average of each predicted cluster's 
F-score: 

FiP,G) = Y,^Fip,G) . (9) 

peP ' 

This effectively penalizes the predicted clustering that is not well-aligned with the 
ground truth, and we use it as the quahty measure of all methods on all datasets. 

4.5.1 CiteSeer 

In Figure [3] we show the experiment results on CiteSeer Since it is known that the 
network has six communities (i.e. sub-fields in computer science), there is no need 
to vary I, the number of desired clusters. We report results using Metis (similar num- 
bers were observed with Markov clusteringi^ For PCL-DC, we set the parameter A 
to 5 as suggested in the original paper, yielding an F-score of 0.570. The F-scores of 
SA-Cluster-Inc and L-P-LDA are 0.348 and 0.458, respectively. As we can see clearly 
in the bar chart, clustering based on topology alone results in a performance well be- 
low the state-of-the-art content-aware clustering methods. This is not surprising as 
the input graph has 438 connected components and therefore most small components 
were randomly assigned a prediction label. Although such approach has no impact on 
topology-based measures (e.g. normalized cut or conductance), it greatly spoils the F- 
score measure against the ground truth. Neither is LDA able to provide a competitive 
result, as it is oblivious to link structure embedded in the dataset. Surprisingly though, 
K-means only manages to produce a very unbalanced clustering (the largest cluster al- 
ways contains more than 90% of all papers) even after 50 iterations, and its F-score 
(averaged over five runs) is only 0.336. 

10 
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Quality on Citeseer 



■li.l.lllll 

Figure 3: F-score of Metis on CiteSeer 

On the other hand, our content-aware approaches (using Metis as the clustering 
method) were able to handle the issue of disconnection as they also include content- 
similar edges. For both similarity measures, the F-scores are within 90% range of 
PCL-DC, and it outperforms PCL-DC when k increases to 70. 

While achieving the quality that is comparable with existing methods, the COD- 
ICIL series are significantly faster PCL-DC takes 234 seconds on this dataset and 
SA-Cluster-Inc requires 306 seconds. LDA finishes in 40 seconds. In contrast, the sum 
of codicil's edge sampling and clustering time never exceeds 1 second. Therefore, 
the CODICIL methods are at least one order of magnitude faster than state-of-the-art 
algorithms. 

4.5.2 Wikipedia 

For the Wikipedia dataset, we were unable to run the experiment on SA-Cluster-Inc, 
PCL-DC, L-P-LDA, LDA and K-means as their memory and/or running time require- 
ment became prohibitive on this million-node network. For example, storing 10,000 
centroids alone in K-means requires 54 GBs). 

Figures|4a]and|4c]plot the performances using MLR-MCL and Metis, respectively. 
Since category assignments as the ground truth are overlapping, there is no gold stan- 
dard for the number of clusters. We therefore varied I in both clustering algorithms. 
Our content-aware clustering algorithms constantly outperforms Sampled Topo by a 
large margin, indicating that CODICIL methods are able to simplify the network and 
recover community structure at the same time. CODICIL methods' F-scores are also 
on par or better than those of Original Topo. 

4.5.3 Flickr 

Figure |5a] shows the performances of various methods with MLR-MCL on Flickr, 
where SA-Cluster-Inc, PCL-DC, LDA and K-means can also finish in a reasonable 
time (L-P-LDA still takes more than 30 hours). Again, / was varied for the clustering 
algorithm. Similar to results on CiteSeer, CODICIL methods again lead the baselines 
by a considerable margin. The F-scores of SA-Cluster-Inc, LDA, and K-means never 
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Figure 4: Experiment Results on Wikipedia 



exceed 0.2, whereas CODICIL methods' F-scores are often higher, together with Orig- 
inal & Sampled Topo. 

Readers may have noticed that for PCL-DC only three data points {I — 50, 75, 100) 
are obtained. That is because its excessive memory consumption crashed our work- 
station after using up 16 GBs of RAM for larger / values. We also observe that while 
PCL-DC generates a group membership distribution over / groups for each vertex, 
fewer than / communities are discovered. That is, there exist groups of which no ver- 
tex is a prominent member. Furthermore, the number of communities discovered is 
decreasing as I increases (45, 43 and 39 communities for / = 50, 75, 100), which is 
opposite to other methods' trends. All three clusterings' F-scores are less than 0.25. 
Similarly, multiple runs of K-means (K is set to 400, 800, 1200, and 1600) can only 
identity roughly 200 communities. 

4.6 Scalability 

The running time on CiteSeer has already been discussed, and here we focus on Flickr 
and Wikipedia. For CODICIL methods, the running time includes both edge sampling 
and clustering stage. The plots' Y-axes (running time) are in log scale. 
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Figure 5: Experiment Results on Flickr 



4.6.1 Flickr 

We first report scalability results on Flickr (see Figure l5b] l. For SA-Cluster-Inc, the 
value of I (the desired output cluster count), ranging from 100 to 5000, does not affect 
its running time as it always stays between 1 and 1.25 hours with memory usage around 
12GB. The running time of LDA appears, to a large extent, linear in the number of 
latent topics (i.e. I) specified, climbing up from 2.56 hours (I = 200) to 15.88 hours 
(I = 1600). For PCL-DC, the running time with three I values (50, 75, 100) is 0.5, 2.0 
and 2.8 hours, respectively. 

As for our content-aware clustering algorithms, running them on Flickr requires 
less than 8 seconds, which is three to four orders of magnitude faster than SA-Cluster- 
Inc, PCL-DC and LDA. Original Topo takes more than 10 seconds, and Sampled Topo 
runs slightly faster than CODICIL methods. 

4.6.2 Wikipedia 

Original Topo, Sampled Topo and all CODICIL methods finished successfully. The 
running time is plotted in Figures |4b] and |4d] When clustering using MLR-MCL, our 
methods are at least one order of magnitude faster than clustering based on network 
topology alone. For Metis, CODICIL is also more than four times faster. The trend 
lines suggest our methods have promising scalability for analysis on even larger net- 
works. 



4.7 Effect of Varying a on F-score 

So far all experiments performed fix a at 0.5, meaning equal weights of structural and 
content similarities. In this sub-section we track how the clustering quality changes 
when the value of a is varied from 0.1 to 0.9 with a step length of 0.1. 

On Wikipedia (Figure |6ali and Citeseer (Figure l6bl l. F-scores are greatest around 
a ~ 0.5, supporting the decision of assigning equal weights to structural and content 
similarities. Results differ on Flickr where F-score is constantly improving when a 
increases (i.e. more weight assigned to topological similarity). 
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Figure 6: Effect of Varying a on F-score (Avg. # Clusters for Wikipedia: 29,414, Avg. 
# Clusters for Flickr: 1,911) 

4.8 Effect of Sc Constraint on F-score 

In Section [3.3.1l we discuss the possibility of constraining content edges within a topo- 
logical neighborhood for each node Vi. Here we provide a brief review on how the 
qualities of resultant clusterings are impacted by such constraint. For the sake of space, 
we focus on the F-scores on Wikipedia and Flickr. 

Figures |7a] and |7b] show F-scores achieved on Wikipedia, using different £c con- 
straints. Full means no constraint and thus TopK sub-routine searches the whole ver- 
tex set V, whereas 1-hop constrains the search to within a one-hop neighborhood, and 
likewise for 2-hop. The plots of full and 2-hop almost overlap with each other, sug- 
gesting that searching within the 2-hop neighborhood can provide sufficiently strong 
content signals on this dataset. For Flickr (Figures [Tel and|7d]i, interestingly 2-hop and 
1-hop have a slight lead overfull. This may be an indication that in online social net- 
works, compared with information networks, content similarity between two closely 
connected users emits stronger community signals. 

4.9 Discussions 

An interesting observation on the biased edge sampling is that it always results in an 
improvement in running time. However, sampling just the topology graph results in a 
clear loss in accuracy whereas content-conscious sampling is much more effective with 
accuracies that are on par with the best performing methods at a fraction of the cost to 
compute. We observe this for all three datasets. 

We also find that for probabilistic-model-based methods (PCL-DC, L-P-LDA and 
LDA) as well as K-means, their running time is at least linear in I, the desired number of 
output clusters, which becomes a critical drawback in face of large-scale workloads. As 
the network grows, the number of clusters also increases naturally. Plots on CODICIL 
methods' running time, on the other hand, suggest a logarithmic increase with regard 
to the number of clusters, which is more affordable. 



17 



Full w/ Cosine 
Full w/ Jaccard 
1-Hop w/ Cosine - 

1 - Hop w/ Jaccard - 

2- Hop w/ Cosine 
2-Hop w/ Jaccard ■ 



10000 20000 30000 40000 50000 60000 
Num ot Output Clusters 

(a) Vaiying £c Constraint, MLR-MCL on Wiki. 

0.4 I , , , , 




Full w/ Cosine — ' — 
Full w/ Jaccard » 
1-Hop w/ Cosine »■ ■ 

1 - Hop w/ Jaccard 

2- Hop w/ Cosine - - ■- - 
2-Hop w/ Jaccard — » — 



400 800 1200 1600 2000 

Num of Output Clusters 

(c) Varying Sc Constraint, MLR-MCL on Flicki- 



Full w/ Cosine 
Full w/ Jaccard 

1- Hop w/ Cosine . 
1 -Hop w/ Jaccard . 

2- Hop w/ Cosine 
2-Hop w/ Jaccard 



10000 20000 30000 40000 50000 60000 
Num of Output Clusters 

(b) Varying £c Constraint, METIS on Wilci. 

0.4 I , , , , 1 



Full w/ Cosine 
Full w/ Jaccard 

1- Hop w/ Cosine . 
1 -Hop w/ Jaccard . 

2- Hop w/ Cosine 
2-Hop w/ Jaccard 



400 800 1200 1600 2000 

Num of Output Clusters 

(d) Varying Ec Constraint, METIS on Flickr 



Figure 7: Effect of Ec Constraint on F-score 



5 Case Studies 

In this section, we demonstrate the benefits of leveraging content information on two 
Wikipedia pages: "Machine Learning" and "Graph (Mathematics)". 

In the original network, "machine learning" has a total degree of 637, and many 
neighbors (including "1-2- AX working memory task", "Wayne State University Com- 
puter Science Department", "Chou-Fasman method", etc.) are at best peripheral to the 
context of machine learning. When we sample the graph according to its link struc- 
ture only, 1 19 neighbors are retained for "machine learning". Although this eliminates 
some noise, many others, including the three entries above, are still preserved. More- 
over, it also removes during the process many neighbors which should have been kept, 
e.g. "naive Bayes classifier", "support vector machine", and so on. 

The CODICIL framework, in contrast, alleviates both problems. Apart from re- 
moving noisy edges, it also keeps the most relevant ones. For example, "AdaBoost", 
"ensemble learning", "pattern recognition" all appear in "machine learning"'s neigh- 
borhood in the sampled edge set Esampie- Perhaps more interestingly, we find that 
CODICIL adds "neural network", an edge absent from the original network, into Esampie 
(recall that it is possible for CODICIL to include an edge even it is not in the original 
graph, given its content similarity is sufficiently high). This again illustrates the core 
philosophy of CODICIL: to complement the original network with content information 
so as to better recover the community structure. 

Similar observations can be made on the "Graph (Mathematics)" page. For ex- 
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ample, CODICIL removes entries including "Eric W. Weisstein", "gadget (computer 
science)" and "interval chromatic number of an ordered graph". It also keeps "clique 
(graph theory)", "Hamiltonian path", "connectivity (graph theory)" and others, which 
would otherwise be removed if we sample the graph using Unk structure alone. 

6 Conclusion 

We have presented an efficient and extremely simple algorithm for community identifi- 
cation in large-scale graphs by fusing content and Unk similarity. Our algorithm, COD- 
ICIL, selectively retains edges of high relevancy within local neighborhoods from the 
fused graph, and subsequently clusters this backbone graph with any content-agnostic 
graph clustering algorithm. 

Our experiments demonstrate that CODICIL outperforms state-of-the-art methods 
in clustering quality while running orders of magnitude faster for moderately-sized 
datasets, and can efficiently handle large graphs with millions of nodes and hundreds 
of millions of edges. While simpUfication can be applied to the original topology alone 
with a small loss of clustering quality, it is particularly potent when combined with con- 
tent edges, deUvering superior clustering quality with excellent runtime performance. 
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